{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split by HTML header\n",
    " HTMLHeaderTextSplitter is a \"structure-aware\" chunker that splits text at the element level and adds metadata for each header \"relevant\" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.\n",
    " \n",
    "HTMLHeaderTextSplitter 是一个 “结构感知”的分块器，它在元素级别分割文本，并将每个“相关”的标题的元数据添加到任何给定的块中。它可以逐个元素返回块，也可以将具有相同元数据的元素组合在一起，其目标是 （a） 保持相关文本在语义上（或多或少）分组，以及 （b） 保留文档结构中编码的上下文丰富的信息。它可以与其他文本拆分器一起使用，作为分块管道的一部分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Foo'),\n",
       " Document(page_content='Some intro text about Foo.  \\nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),\n",
       " Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),\n",
       " Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),\n",
       " Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),\n",
       " Document(page_content='Baz', metadata={'Header 1': 'Foo'}),\n",
       " Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),\n",
       " Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# %pip install -qU langchain-text-splitters\n",
    "from langchain_text_splitters import HTMLHeaderTextSplitter\n",
    "\n",
    "html_string = \"\"\"\n",
    "<!DOCTYPE html>\n",
    "<html>\n",
    "<body>\n",
    "    <div>\n",
    "        <h1>Foo</h1>\n",
    "        <p>Some intro text about Foo.</p>\n",
    "        <div>\n",
    "            <h2>Bar main section</h2>\n",
    "            <p>Some intro text about Bar.</p>\n",
    "            <h3>Bar subsection 1</h3>\n",
    "            <p>Some text about the first subtopic of Bar.</p>\n",
    "            <h3>Bar subsection 2</h3>\n",
    "            <p>Some text about the second subtopic of Bar.</p>\n",
    "        </div>\n",
    "        <div>\n",
    "            <h2>Baz</h2>\n",
    "            <p>Some text about Baz</p>\n",
    "        </div>\n",
    "        <br>\n",
    "        <p>Some concluding text about Foo</p>\n",
    "    </div>\n",
    "</body>\n",
    "</html>\n",
    "\"\"\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"h1\", \"Header 1\"),\n",
    "    (\"h2\", \"Header 2\"),\n",
    "    (\"h3\", \"Header 3\"),\n",
    "]\n",
    "\n",
    "html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "html_header_splits = html_splitter.split_text(html_string)\n",
    "html_header_splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
       " Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
       " Document(page_content='This account of Gödel’s discovery was told to Hao Wang very much after the fact; but in Gödel’s contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
       " Document(page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel’s publication of that theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),\n",
       " Document(page_content='We now describe the proof of the two theorems, formulating Gödel’s results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel’s notation.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'})]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "url = \"https://plato.stanford.edu/entries/goedel/\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"h1\", \"Header 1\"),\n",
    "    (\"h2\", \"Header 2\"),\n",
    "    (\"h3\", \"Header 3\"),\n",
    "    (\"h4\", \"Header 4\"),\n",
    "]\n",
    "\n",
    "html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "\n",
    "# for local file use html_splitter.split_text_from_file(<path_to_file>)\n",
    "html_header_splits = html_splitter.split_text_from_url(url)\n",
    "\n",
    "chunk_size = 500\n",
    "chunk_overlap = 30\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size, chunk_overlap=chunk_overlap\n",
    ")\n",
    "\n",
    "# Split\n",
    "splits = text_splitter.split_documents(html_header_splits)\n",
    "splits[80:85]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain0_1",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
